151 research outputs found

    The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment

    Get PDF
    OBJECTIVE: Coronavirus disease 2019 (COVID-19) poses societal challenges that require expeditious data and knowledge sharing. Though organizational clinical data are abundant, these are largely inaccessible to outside researchers. Statistical, machine learning, and causal analyses are most successful with large-scale data beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many centers. MATERIALS AND METHODS: The Clinical and Translational Science Award Program and scientific community created N3C to overcome technical, regulatory, policy, and governance barriers to sharing and harmonizing individual-level clinical data. We developed solutions to extract, aggregate, and harmonize data across organizations and data models, and created a secure data enclave to enable efficient, transparent, and reproducible collaborative analytics. RESULTS: Organized in inclusive workstreams, we created legal agreements and governance for organizations and researchers; data extraction scripts to identify and ingest positive, negative, and possible COVID-19 cases; a data quality assurance and harmonization pipeline to create a single harmonized dataset; population of the secure data enclave with data, machine learning, and statistical analytics tools; dissemination mechanisms; and a synthetic data pilot to democratize data access. CONCLUSIONS: The N3C has demonstrated that a multisite collaborative learning health network can overcome barriers to rapidly build a scalable infrastructure incorporating multiorganizational clinical data for COVID-19 analytics. We expect this effort to save lives by enabling rapid collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care and thereby reduce the immediate and long-term impacts of COVID-19

    Better together: Integrating biomedical informatics and healthcare IT operations to create a learning health system during the COVID-19 pandemic

    Get PDF
    The growing availability of multi-scale biomedical data sources that can be used to enable research and improve healthcare delivery has brought about what can be described as a healthcare data age. This new era is defined by the explosive growth in bio-molecular, clinical, and population-level data that can be readily accessed by researchers, clinicians, and decision-makers, and utilized for systems-level approaches to hypothesis generation and testing as well as operational decision-making. However, taking full advantage of these unprecedented opportunities presents an opportunity to revisit the alignment between traditionally academic biomedical informatics (BMI) and operational healthcare information technology (HIT) personnel and activities in academic health systems. While the history of the academic field of BMI includes active engagement in the delivery of operational HIT platforms, in many contemporary settings these efforts have grown distinct. Recent experiences during the COVID-19 pandemic have demonstrated greater coordination of BMI and HIT activities that have allowed organizations to respond to pandemic-related changes more effectively, with demonstrable and positive impact as a result. In this position paper, we discuss the challenges and opportunities associated with driving alignment between BMI and HIT, as viewed from the perspective of a learning healthcare system. In doing so, we hope to illustrate the benefits of coordination between BMI and HIT in terms of the quality, safety, and outcomes of care provided to patients and populations, demonstrating that these two groups can be better together

    A protocol to evaluate RNA sequencing normalization methods

    Get PDF
    Background RNA sequencing technologies have allowed researchers to gain a better understanding of how the transcriptome affects disease. However, sequencing technologies often unintentionally introduce experimental error into RNA sequencing data. To counteract this, normalization methods are standardly applied with the intent of reducing the non-biologically derived variability inherent in transcriptomic measurements. However, the comparative efficacy of the various normalization techniques has not been tested in a standardized manner. Here we propose tests that evaluate numerous normalization techniques and applied them to a large-scale standard data set. These tests comprise a protocol that allows researchers to measure the amount of non-biological variability which is present in any data set after normalization has been performed, a crucial step to assessing the biological validity of data following normalization. Results In this study we present two tests to assess the validity of normalization methods applied to a large-scale data set collected for systematic evaluation purposes. We tested various RNASeq normalization procedures and concluded that transcripts per million (TPM) was the best performing normalization method based on its preservation of biological signal as compared to the other methods tested. Conclusion Normalization is of vital importance to accurately interpret the results of genomic and transcriptomic experiments. More work, however, needs to be performed to optimize normalization methods for RNASeq data. The present effort helps pave the way for more systematic evaluations of normalization methods across different platforms. With our proposed schema researchers can evaluate their own or future normalization methods to further improve the field of RNASeq normalization

    Machine learning for modeling the progression of Alzheimer disease dementia using clinical data: A systematic literature review

    Get PDF
    OBJECTIVE: Alzheimer disease (AD) is the most common cause of dementia, a syndrome characterized by cognitive impairment severe enough to interfere with activities of daily life. We aimed to conduct a systematic literature review (SLR) of studies that applied machine learning (ML) methods to clinical data derived from electronic health records in order to model risk for progression of AD dementia. MATERIALS AND METHODS: We searched for articles published between January 1, 2010, and May 31, 2020, in PubMed, Scopus, ScienceDirect, IEEE Explore Digital Library, Association for Computing Machinery Digital Library, and arXiv. We used predefined criteria to select relevant articles and summarized them according to key components of ML analysis such as data characteristics, computational algorithms, and research focus. RESULTS: There has been a considerable rise over the past 5 years in the number of research papers using ML-based analysis for AD dementia modeling. We reviewed 64 relevant articles in our SLR. The results suggest that majority of existing research has focused on predicting progression of AD dementia using publicly available datasets containing both neuroimaging and clinical data (neurobehavioral status exam scores, patient demographics, neuroimaging data, and laboratory test values). DISCUSSION: Identifying individuals at risk for progression of AD dementia could potentially help to personalize disease management to plan future care. Clinical data consisting of both structured data tables and clinical notes can be effectively used in ML-based approaches to model risk for AD dementia progression. Data sharing and reproducibility of results can enhance the impact, adaptation, and generalizability of this research

    Extraction of clinical phenotypes for Alzheimer\u27s disease dementia from clinical notes using natural language processing

    Get PDF
    OBJECTIVES: There is much interest in utilizing clinical data for developing prediction models for Alzheimer\u27s disease (AD) risk, progression, and outcomes. Existing studies have mostly utilized curated research registries, image analysis, and structured electronic health record (EHR) data. However, much critical information resides in relatively inaccessible unstructured clinical notes within the EHR. MATERIALS AND METHODS: We developed a natural language processing (NLP)-based pipeline to extract AD-related clinical phenotypes, documenting strategies for success and assessing the utility of mining unstructured clinical notes. We evaluated the pipeline against gold-standard manual annotations performed by 2 clinical dementia experts for AD-related clinical phenotypes including medical comorbidities, biomarkers, neurobehavioral test scores, behavioral indicators of cognitive decline, family history, and neuroimaging findings. RESULTS: Documentation rates for each phenotype varied in the structured versus unstructured EHR. Interannotator agreement was high (Cohen\u27s kappa = 0.72-1) and positively correlated with the NLP-based phenotype extraction pipeline\u27s performance (average F1-score = 0.65-0.99) for each phenotype. DISCUSSION: We developed an automated NLP-based pipeline to extract informative phenotypes that may improve the performance of eventual machine learning predictive models for AD. In the process, we examined documentation practices for each phenotype relevant to the care of AD patients and identified factors for success. CONCLUSION: Success of our NLP-based phenotype extraction pipeline depended on domain-specific knowledge and focus on a specific clinical domain instead of maximizing generalizability

    OpenSep: A generalizable open source pipeline for SOFA score calculation and Sepsis-3 classification

    Get PDF
    EHR-based sepsis research often uses heterogeneous definitions of sepsis leading to poor generalizability and difficulty in comparing studies to each other. We have developed OpenSep, an open-source pipeline for sepsis phenotyping according to the Sepsis-3 definition, as well as determination of time of sepsis onset and SOFA scores. The Minimal Sepsis Data Model was developed alongside the pipeline to enable the execution of the pipeline to diverse sources of electronic health record data. The pipeline\u27s accuracy was validated by applying it to the MIMIC-IV version 1.0 data and comparing sepsis onset and SOFA scores to those produced by the pipeline developed by the curators of MIMIC. We demonstrated high reliability between both the sepsis onsets and SOFA scores, however the use of the Minimal Sepsis Data model developed for this work allows our pipeline to be applied to more broadly to data sources beyond MIMIC

    Pattern recognition in lymphoid malignancies using CytoGPS and Mercator

    Get PDF
    BACKGROUND: There have been many recent breakthroughs in processing and analyzing large-scale data sets in biomedical informatics. For example, the CytoGPS algorithm has enabled the use of text-based karyotypes by transforming them into a binary model. However, such advances are accompanied by new problems of data sparsity, heterogeneity, and noisiness that are magnified by the large-scale multidimensional nature of the data. To address these problems, we developed the Mercator R package, which processes and visualizes binary biomedical data. We use Mercator to address biomedical questions of cytogenetic patterns relating to lymphoid hematologic malignancies, which include a broad set of leukemias and lymphomas. Karyotype data are one of the most common form of genetic data collected on lymphoid malignancies, because karyotyping is part of the standard of care in these cancers. RESULTS: In this paper we combine the analytic power of CytoGPS and Mercator to perform a large-scale multidimensional pattern recognition study on 22,741 karyotype samples in 47 different hematologic malignancies obtained from the public Mitelman database. CONCLUSION: Our findings indicate that Mercator was able to identify both known and novel cytogenetic patterns across different lymphoid malignancies, furthering our understanding of the genetics of these diseases

    Transmission dynamics: Data sharing in the COVID-19 era

    Get PDF
    Problem: The current coronavirus disease 2019 (COVID-19) pandemic underscores the need for building and sustaining public health data infrastructure to support a rapid local, regional, national, and international response. Despite a historical context of public health crises, data sharing agreements and transactional standards do not uniformly exist between institutions which hamper a foundational infrastructure to meet data sharing and integration needs for the advancement of public health. Approach: There is a growing need to apply population health knowledge with technological solutions to data transfer, integration, and reasoning, to improve health in a broader learning health system ecosystem. To achieve this, data must be combined from healthcare provider organizations, public health departments, and other settings. Public health entities are in a unique position to consume these data, however, most do not yet have the infrastructure required to integrate data sources and apply computable knowledge to combat this pandemic. Outcomes: Herein, we describe lessons learned and a framework to address these needs, which focus on: (a) identifying and filling technology gaps ; (b) pursuing collaborative design of data sharing requirements and transmission mechanisms; (c) facilitating cross-domain discussions involving legal and research compliance; and (d) establishing or participating in multi-institutional convening or coordinating activities. Next steps: While by no means a comprehensive evaluation of such issues, we envision that many of our experiences are universal. We hope those elucidated can serve as the catalyst for a robust community-wide dialogue on what steps can and should be taken to ensure that our regional and national health care systems can truly learn, in a rapid manner, so as to respond to this and future emergent public health crises
    corecore